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Traditional random grapli models of networks generate networks that are locally tree-like, meaning 
that all local neighborhoods take the form of trees. In this respect such models are highly unrealistic, 
most real networks having strongly non-tree-like neighborhoods that contain short loops, cliques, or 
other biconnected subgraphs. In this paper we propose and analyze a new class of random graph 
models that incorporates general subgraphs, allowing for non-tree-like neighborhoods while still 
remaining solvable for many fundamental network properties. Among other things we give solutions 
for the size of the giant component, the position of the phase transition at which the giant component 
appears, and percolation properties for both site and bond percolation on networks generated by 
the model. 
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I. INTRODUCTION 

It was pointed out by Rapoport in the 1940s 1] and 
more recently by Watts and Strogatz 2] that many ob- 
served networks contain a statistically surprising num- 
ber of triangles — sets of three vertices connected by three 
edges. More generally it has been noted that many small 
subgraphs occur in networks in numbers greater than one 
would expect purely on the basis of chance [J-Q . In some 
cases there are good reasons for the frequent occurrence 
of a particular subgraph. Social networks, for instance, 
display a lot of triangles because an individual's friends 
are likely themselves to be friends. In technological and 
biological networks certain subgraphs appear to perform 
modular tasks that contribute to the networks' overall 
operation 3] and hence may be evolutionarily favored. 
In other cases the reason for the appearance of the sub- 
graphs is unclear, but the empirical evidence for their 
presence is nonetheless convincing, and hence if we wish 
to model these networks accurately the appropriate sub- 
graphs must be included. 

Unfortunately, few practical models of complex net- 
works exist that include significant densities of arbitrary 
small subgraphs. Random graph models have been de- 
veloped that incorporate numerous features of real- world 
networks, including arbitrary degree distributions [6|,|7|, 
correlationsjS, ^] , bipartite and multi-partite structure, 
hierarchy [lO[, vertex ordering |ll| . and even geome- 
try [l2|, but until recently there were no equivalent ran- 
dom graph models incorporating nontrivial densities of 
general subgraphs. Exponential random graphs do allow 
general subgraphs in principle, but in practice they are 
difficult to analyze and moreover can display pathological 
behaviors that limit their usefulness. 

In this paper we introduce a class of models that build 
upon conventional random graphs and incorporate non- 
trivial densities of arbitrary subgraphs, while remaining 
exactly solvable in the limit of large network size for 
a variety of properties both local and global, including 
degree distributions, clustering coefficients, component 
sizes, percolation properties, and others. A number of 
new models have appeared in recent years that achieve 



similar goals in other contexts J13l - ll9l ] , and many of these 
are, as we will see, special cases of the general formalism 
introduced in this paper. We begin our discussion with 
a description of a special case of the formalism, that of 
a network containing only random edges and triangles, 
which was described previously in Refs. [lalS- Then, 
using this model as a starting point, we further develop 
the general foundations that will allow us to model net- 
works with any choice of subgraphs we please. 



II. A SIMPLE EXAMPLE 

Consider to begin with the standard Poisson random 
graph, famously studied by Erdos and Renyi in the 1950s 
and 60s ^20, 21,]. In this model each distinct pair of ver- 
tices, out of n vertices total, is (or is not) connected by 
an edge with independent probability p = c/n (or I — p), 
where c is a constant. A particular subgraph with n^ ver- 
tices and nis edges can occur in (^ ) ~ 71"=/^^! positions 
in such a graph, each with probability 



p"=(l-p)("20- 



(1) 



Thus the expected number of occurrences of such a sub- 
graph is 0(71"""™=). If the subgraph is connected then 
rUs ^ ng — l with the equality applying only in the case of 
a tree. Thus the number of connected subgraphs grows 
with system size only for the case of trees and for all other 
subgraphs is constant or decreasing, making the density 
of all such non-tree subgraphs vanish in the large-n limit. 
A similar argument applies to most of the extensions of 
the random graph. These graphs are thus "locally tree- 
like" — all small connected subsets of vertices within the 
network are trees. This means, for instance, that such 
graphs have a vanishing density of triangles in the limit 
of large size, in strong contrast to real networks, which 
frequently have a very high triangle density [22i] . 

Although it is unrealistic, the locally tree-like nature 
of random graphs is also the crucial feature that makes 
the model tractable. The calculations of the size of the 
giant component, the size distribution of small compo- 
nents, the percolation features of the network, and many 



other properties are all crucially dependent on the tree- 
like property. This is unfortunate, since the tree-like 
nature of the network is destroyed by the introduction 
of a finite density of any subgraph that contains one or 
more loops, which suggests that generalizations of ran- 
dom graph models containing such subgraphs may be in- 
trinsically unsolvable. As we show in this paper, however, 
this turns out not to be the case. By exploiting tree-like 
structure at a higher level, the level of the so-called "fac- 
tor graph," we can introduce arbitrary distributions of 
subgraphs into the network, including those containing 
loops, and still solve exactly for global properties of the 
network, even though the network is now explicitly not 
locally tree-like. As a first demonstration of the process, 
consider the following simple model, which we will call 
the "edge-triangle" model. 

The edge-triangle model was proposed previously 
in |16J, [l9| as a way to incorporate the phenomenon of 
clustering or transitivity into random graphs. It gen- 
eralizes the configuration model, the standard model 
of a network with arbitrary degree distribution [6|, |7|. 
In the configuration model one specifies the number of 
edges attached to each vertex i — the so-called "degree 
sequence" — as the fundamental parameters of the net- 
work. In the edge-triangle model one specifies instead 
the number U of triangles that each vertex participates 
in along with the number Si of "single edges," meaning 
edges that are not part of a triangle. One can picture 
vertex i as having Si "stubs" of edges emerging from it 
and ti corners of triangles. Then an edge-triangle net- 
work is generated by choosing stubs randomly in pairs 
and joining them to form complete edges until no stubs 
remain, and also choosing triangle corners in threes and 
joining them to form complete triangles until no corners 
remain. The end result is a network drawn uniformly 
at random from the set of all possible matchings of the 
stubs and corners, and the edge-triangle model is defined 
to be the ensemble of such matchings in which each ap- 
pears with equal probability. Note that to create a com- 
plete matching we require that the total number of stubs 
be a multiple of two and the total number of corners a 
multiple of three. 

The undirected networks generated by the edge- 
triangle model are clearly not locally tree-like, since they 
contain triangles. Yet one can still proceed with analytic 
calculations. Consider for instance the following calcu- 
lation of the size of the giant component (the extensive 
part of the network in which any vertex can reach any 
other via at least one path). 

Let us define the joint degree distribution p{s,t) for 
our network to be the fraction of vertices connected to s 
single edges and t triangles. As with other random graph 
models, it's helpful to define a generating function Go for 
this degree distribution: 



^0(2:1, 2:2) = ^p{s,t)z'li 



(2) 



bution." Excess degree is a property of the vertex one 
reaches by following an edge in a network and is normally 
defined to be the number of edges connected to such a 
vertex other than the edge one followed in the first place. 
In the edge-triangle model, where the "degree" of each 
vertex is denoted by the two numbers s and t, there are 
now two corresponding excess degrees. If one follows a 
single edge to reach a vertex then the excess degree is 
given by the number t of triangles attached to that ver- 
tex and the number s of single edges other than the edge 
via which we arrived. Similarly if one follows a triangle 
to reach a vertex then the excess degree is given by the 
number of single edges attached to that vertex and the 
number of triangles other than the one via which we ar- 
rived. It is straightforward to show that the distributions 
of these excess degrees are, respectively. 



/ X (i + 1) . 
r{s,t) = ^—y^p{s,t + l), 



(3) 
(4) 



where (s) and (i) are the average numbers of stubs and 
corners at a vertex in the network as a whole. The gen- 
erating functions for the excess degree distributions are 



Gi(zi,Z2) = ^q{s,t)zlzl, 

St 



(5) 
(6) 



The calculation of the giant component size now pro- 
ceeds as follows. Let ui be the probability that the vertex 
reached by following a single edge (an edge that is not 
part of a triangle) is not connected to the giant compo- 
nent by any of its other edges or triangles, and let U2 
be the probability that neither of the vertices reached 
by following a triangle is connected to the giant compo- 
nent by any of their other triangles or edges. If the ver- 
tex reached by following an edge is connected to s other 
edges and t triangles then the probability that none of 
them leads to the giant component is wfwl, where s and 
t are distributed according to the excess degree distri- 
bution g(s,i). Averaging over this distribution, we find 
that 



Ul 



Z^ 



q{s,t)u{u\ = Gi(ui,U2). 



(7) 



Similarly, if a vertex reached by following a triangle is 
is connected to t other triangles and s edges then the 
probability that none of them leads to the giant com- 
ponent is again ufu^, but with s and t now distributed 
according to r{s,t). Averaging over r{s,t) then gives an 
average probability of G2(ui, U2) and the total probabil- 
ity for both vertices reached via a triangle is the square 
of this quantity: 



Also important is the so-called "excess degree distri- 



U2 = [G2(wi,U2)]' 



(8) 



Finally, a randomly selected vertex with s stubs and 
t triangles is not in the giant component if none of 
its neighbors are in the giant component, which hap- 
pens with probability uju2 where s and t are distributed 
according to p{s,t). Averaging over p{s,t), we find 
the probability of not being in the giant component to 
be Go{ui, U2), and the probability S of being in the giant 
component is one minus this: 



S =1- Go(ui,U2)- 



(9) 



Between them Eqs. ([7]) to ([2]) give the size of the giant 
component in our edge-triangle network as a fraction of 
the size of the whole network. While they cannot always 
be solved analytically, they can be solved numerically by 
simple iteration: one makes an initial guess about the val- 
ues of Ml and U2 and iterates ([7]) and ([5]) to convergence, 
then substitutes the resulting values into (O- (The size 
of the giant component is only one example of a quantity 
that can be calculated within the edge-triangle model. 
For other examples, including clustering coefficient and 
percolation properties see Ref. [l6|.) 

The calculation of the giant component size in fact 
follows quite closely the method used for other, locally 
tree-like random graph models such as the configuration 
model [7| and does not appear significantly more complex 
despite the addition of triangles, which destroy the tree- 
like property. The reason for this is that, while the edge- 
triangle model is indeed not tree-like in a naive sense, 
it is still tree-like at a higher level, above the level of 
the triangles. Specifically, in the limit of large graph 
size, a finite-sized local neighborhood of a vertex in the 
edge-triangle model is a connected graph whose largest 
biconnected component (of which there can be many) is 
a triangle. This means that if we regard each triangle in 
the network as a single three- vertex unit (and each single 
edge as a two-vertex unit) then local neighborhoods are 
tree-like at the level of these units. We will develop this 
idea further shortly. 



III. A GENERAL MODEL 

The edge-triangle model provides a simple, solvable 
model of networks that contain triangles. It does, how- 
ever, have some disadvantages. In particular, the proba- 
bility that any two triangles connected to the same ver- 
tex will share an edge vanishes in the limit of large graph 
size, a direct consequence of the fact that the networks 
generated by the model are tree-like above the level of 
triangles. In real networks, by contrast, it is a common 
occurrence that triangles share an edge, and hence the 
model is unrealistic in this respect. 

One way to solve this problem would be to modify the 
model in some way to encourage triangles to share edges, 
but this is unsatisfactory because, in so doing, we would 
destroy the locally tree-like property of the network (at 
the triangle level) that allows its solution. Instead, there- 
fore, we propose an alternative approach: we consider 




FIG. 1: A small network made of single edges, triangles, and 
"diamond" subgraphs composed of two overlapping triangles. 



two triangles that share an edge to form a new network 
element, analogous to the triangle, but now with four 
vertices instead of three — see Fig. [T] Our approach is to 
create a model that introduces a specified distribution 
of these larger elements in exactly the same way that we 
previously introduced triangles. More generally, it is pos- 
sible to define a model in which we introduce arbitrary 
distributions of any subgraphs we please. (The possibil- 
ity of such a model was mentioned briefiy in Refs. [16[ 
and [i3-) As we will show, such models can always be 
viewed as tree-like at a suitable higher level and thereby 
solved exactly for properties both local and global in the 
limit of large size. 



A. Subgraphs and roles 

In the model we propose, we first specify a set of sub- 
graphs that will be added to the network. Three exam- 
ples of possible sets are shown in Fig. [51 The network 
will be created by specifying the number of each of the 
subgraphs attached to each vertex and then sampling 
randomly from the (usually large) set of compatible net- 
works. The edge-triangle model of Section |lT] is an ex- 
ample of such a model in which the set of subgraphs 
numbers just two — the single edge and the triangle as 
shown in Fig. [5}d. The model of this section generalizes 
the edge-triangle model to arbitrary subgraph sets of ar- 
bitrary size. 

This generalization, however, introduces an important 
new feature to the model that was not present in the 
edge-triangle model. It is not sufficient, in the general 
case, merely to specify how many copies of each subgraph 
are connected to each vertex because the vertices in the 
subgraphs can play more than one role. Consider the 
diamond-shaped subgraph of Fig. [^b. Two of the ver- 
tices in this subgraph have three incident edges while the 
others have two. Specifying only that a vertex belongs 
to such a subgraph is therefore ambiguous. We need to 
specify also which role the vertex plays (a point made 
previously by Miller |l9{). A vertex could, for example. 








FIG. 2: Three examples of possible subgraph sets for the 
model described in the text, (a) The subgraph set for the 
edge-triangle model of Section [III (b) The subgraph set for 
the model with overlapping triangles shown in Fig. [l] (c) A 
more extensive set that includes both singly and doubly con- 
nected subgraphs. 



participate in five diamonds, playing the three-edge role 
in, say, four of them, and the two edge role in the fifth. 

The concept of a role in a subgraph can be made precise 
by employing concepts from graph theory, specifically the 
concepts of automorphisms and orbits. Consider a sub- 
graph in which each vertex has a unique identifying label. 
An automorphism of the subgraph is a permutation of 
the labels such that the set of label pairs joined by edges 
remains unchanged. Consider Fig. [3l for instance, which 
shows the "diamond" subgraph from Fig. ^p with integer 
labels on its vertices. The four panels within the figure 
show four different automorphisms of the graph so that 
there is, for instance always an edge between vertices 1 
and 2 in each panel, regardless of the permutation of the 
labels. Formally, the set of all automorphisms of a sub- 
graph G forms a group, which is called the automorphism 
group and denoted Aut(G'). 

Now consider a particular vertex within our subgraph. 
The orbit of the vertex is defined to be the set of other 
vertices with which it shares at least one label under the 
label permutations of the automorphism group. The left- 
most vertex in the subgraph of Fig.[3l for example, shares 
labels 1 and 2 with the right-most vertex, but shares no 
labels with either of the other vertices. Hence the left 
and right vertices form an orbit. Similarly the top and 
bottom vertices form an orbit. The orbits correspond to 
the subgraph "roles" discussed above: they enumerate 
each of the topologically distinct situations in which a 
vertex can find itself. 

Unfortunately, the problem of calculating the orbits of 
the automorphism group of a graph is computationally 



FIG. 3: Vertices 1 and 2 in this diamond-shaped subgraph 
constitute a role as defined in the text, as do vertices 3 and 4. 



hard and no polynomial-time algorithm is known. Our 
focus here, however, is on relatively small subgraphs for 
which roles can be found quite simply by hand. 

Once subgraph roles are defined then we can specify 
completely the number of subgraphs in which each ver- 
tex participates and the manner in which it does so. We 
assign to each vertex a set of role indices di,d2, ■ ■ ■, speci- 
fying the number of times each vertex plays each possible 
role. For instance, in the subgraph set shown in Fig. [^b, 
there are four different roles — one for each of the first 
two subgraphs and two for the third subgraph. Thus 
each vertex would in this case be assigned four role in- 
dices di . . . (i4 giving the number of times it plays each of 
these roles. The set of role indices for a given vertex form 
a vector d with c non-negative integer elements, where c 
is number of different roles in the subgraph set. This vec- 
tor plays a part similar to that played by the degree in 
conventional network models and, by analogy with the 
degree sequence in such models, we will call the set of 
vectors for all vertices the role sequence. 

Just as with degree sequences in other models, not all 
role sequences correspond to possible networks. Consider 
again the diamond subgraph of Fig. [31 with its two roles. 
Because both roles appear two times in this subgraph, 
the total number of times vertices in a network partici- 
pate in each of the roles must be equal to the same mul- 
tiple of two. More generally, the total number of times 
vertices play a particular role in a particular subgraph 
must always be an multiple of the number of times that 
role appears in the subgraph, and moreover that multi- 
ple must be the same for all roles of the given subgraph. 
Role sequences that fulfill these conditions are said to be 
graphical. 



B. Network creation 

Given a graphical role sequence, a random network 
drawn from the ensemble of our proposed model can be 
generated in a straightforward fashion. The role indices 
for each vertex dictate the numbers of stubs of each type 
attached to the vertex and creation of the network in- 
volves working through each of the subgraph types in 
turn and repeatedly choosing a set of stubs at random in 
the appropriate combination for that subgraph and con- 
necting them together to make the subgraph in question. 
When all stubs for a given subgraph type have been ex- 
hausted we move on to the next subgraph until the list of 
subgraphs is also exhausted. The end result is a match- 
ing of stubs to form subgraphs, drawn uniformly from 
the set of all possible such matchings. 

It is possible in the creation of a given subgraph that 
two or more stubs attached to the same vertex will be 
drawn from the set of available stubs. If this happens 
the resulting network will contain either an edge that 
connects a vertex to itself — a self-edge — or multiple edges 
between the same pair of vertices — a multiedge — or both. 
In the model we propose, such edges are allowed to ex- 
ist, even though they are prohibited in many real- world 
networks. A similar situation arises in the standard con- 
figuration model of graph theory and, as in that model, 
the densities of self-edges and multiedges in our model 
both vanish in the limit of large network size as 1/n and 
hence can be neglected in this limit. It is possible to 
create models that generate "simple graphs," i.e., graphs 
without self-edges or multiedges, but such models are 
considerably more complicated to work with, both ana- 
lytically and computationally. 

In the calculations presented here, we will not assume 
that we are given a complete role sequence for all ver- 
tices, but only that we are given a role distribution p(d), 
which specifies the probability that a randomly chosen 
vertex has role vector d (or equivalently specifies the frac- 
tion of vertices having role vector d in the large-n limit). 
The network itself will then be defined by drawing a role 
sequence from this distribution and forming a random 
matching of the set of stubs specified by the sequence. 

The constraint that the role sequence must be graphi- 
cal imposes a corresponding constraint on the role distri- 
bution. Let {dr) = X^d drP{d-), where the sum is over all 
possible values of the role vector d. Then the expected 
number of rolc-r stubs in the entire network is n{dr) and 
the number of occurrences of the corresponding subgraph 
is n{dr) /rir, where rir is the number of times that role r 
appears in a single instance of the subgraph. We can 
calculate a corresponding figure for the number of occur- 
rences of the subgraph using any of its other roles and 
all of these figures must be equal if the role sequence is 
to be graphical. Thus we must have 



Assuming we have a role distribution p{d) that satisfies 
this constraint, the procedure for generating a network is 
first to draw a complete role sequence di , i — 1 . . .n for 
the network. It is possible that, despite the constraint 
on p{d), this sequence will not be graphical, because of 
statistical fluctuations in the role vectors drawn. If this is 
the case, we choose a vertex uniformly at random, discard 
its role vector and draw another from p{d). We repeat 
this process until the role sequence is graphical. Then 
the network itself is generated by random matching of 
stubs as described above. 

The role distribution also fixes the conventional degree 
distribution for the network. Each role r has some num- 
ber kr of associated edges, so a vertex with role indices 
di . . .dc has X)r krdr edges. The fraction of vertices with 
total degree k is then 



p{k) = Y,p{d)5(k,Y,kA 



{dr) _ ids) 
for all pairs r, s of roles within the same subgraph. 



(10) 



(11) 



While nothing in the above strictly demands it, we will 
here consider only sparse graphs for which the average 
total degree of a vertex (fc) is constant in the limit of 
large n. From Eq. (fTTj) we have 

(fc) = ^ fcp(fc) = ^ fc ^ p{d)5 [k^Y^Kdr) 
k fed r 

= ^kr^drPid) ^^kr{dr). (12) 

r d r 

IV. BIPARTITE GRAPH REPRESENTATION 

Like the edge-triangle model (of which it is a superset), 
this subgraph model is not in general locally tree-like but 
can be thought of as tree-like at a higher level. If one 
considers the subgraphs as coherent graph units in their 
own right, then the network is still tree-like at the level 
of these units. This idea can be made more precise as 
follows. 

Our subgraph model has an alternative representation 
as a bipartite graph: a network with two types of vertices 
and edges running only between unlike types. In this 
representation one of the types of vertices corresponds 
to the vertices of the original network while the other 
corresponds to the subgraphs, and each original vertex 
is connected by an edge to the subgraphs in which it 
participates. (In other circumstances this representation 
would be called a factor graph.) In order to distinguish 
the different roles that a vertex can play in a subgraph the 
edges can be labeled with the appropriate role numbers r. 

A vertex in the original graph having role vector d — 
(di, (^2, ■ • ■ , dc) is represented in the bipartite graph by a 
vertex with di incident edges marked with role label 1, 
d2 edges marked with label 2, and so forth. On the other 
side of the bipartite graph, every vertex representing a 
given subgraph in the original network has the same num- 
ber of stubs, again labeled by role, one for each vertex in 
the subgraph. 





FIG. 4: (a) A small network composed of a square subgraph 
and a triangle and (b) the bipartite representation of the same 
network. 



The process of creating a network in the original rep- 
resentation of the model is equivalent in the bipartite 
representation to a random matching of stubs subject to 
the constraint that every edge created joins a single ver- 
tex stub to a single subgraph stub having the same role 
label. Figure |4] illustrates the equivalence between the 
two representations. 

The important point to notice about this construction 
is that in the limit of large graph size the bipartite graph 
is strictly locally tree-like for the same reason that stan- 
dard random graphs are locally tree-like — the density of 
loops of any fixed size falls off as 1/n as n becomes large. 
Thus computations can be carried out on the bipartite 
graph using standard techniques for such networks [3] 
and the results can then be translated back into the lan- 
guage of the original network to calculate properties of 
the subgraph model. In the following sections we give 
a number of such calculations of increasing complexity. 
The first and simplest is the calculation of the expected 
number of occurrences of each subgraph in the subgraph 
model. 



A. Subgraph counts 



removal of that one vertex will disconnected them and 
hence they cannot have been biconnected. The locally 
tree-like property of the bipartite representation of the 
subgraph model, however, implies that the probability of 
two subgraphs of given size meeting at two places van- 
ishes in the limit of large graph size. Such subgraphs 
would form a loop of length four in the bipartite graph 
composed of the two subgraphs and the two vertices at 
which they meet. Since the bipartite graph is locally 
tree-like such loops do not occur (or more properly occur 
with vanishing density in the limit of large n). This im- 
plies that the density of biconnected subgraphs formed 
by the joining of others vanishes in our subgraph model 
and hence that if we wish to know about biconnected sub- 
graphs we need concern ourselves only with those added 
either explicitly to the network or implicitly as part of 
a larger subgraph. And the expected number of such 
subgraphs is trivial to calculate — we simply sum up the 
appropriate numbers. 

One important consequence of this result is that we 
can calculate the clustering coefficient of our network in 
a straightforward manner. The clustering coefficient is 
defined to be [3| 



C = 



3 X (number of triangles in network) 37Va 



(number of connected triples) 



N3 



(13) 

Since the triangle is a biconnected subgraph (in fact the 
smallest such subgraph), the number of triangles N/^ is 
given simply by counting how many triangles appear in 
each allowed subgraph, multiplying by the expected num- 
ber of the respective subgraphs, and summing over sub- 
graphs. The number of connected triples N3 is given by 



X! ( o)p(^) = 5"\ ^^rdr ~^krd 



(14) 



If we want to know how many times a particular sub- 
graph occurs in our subgraph model we must allow for 
three distinct ways in which a subgraph can be created: 
it can be added explicitly as one of the set of allowed sub- 
graphs; it can be added implicitly as part of a larger sub- 
graph; or it can be created by putting two or more sub- 
graphs together. For instance, a triangle can be formed 
if we explicitly add a triangle to the network, if we add 
a larger subgraph containing a triangle (such as the dia- 
mond subgraph in Fig.l^b), or we can add three indepen- 
dent single edges to the network that happen to coincide 
and create a triangle. 

There is an important distinction to be drawn here be- 
tween subgraphs that are biconnected and those that are 
only singly connected. Recall that a biconnected graph is 
one in which there exist at least two vertex-independent 
paths between any pair of vertices, or equivalently in 
which at least two vertices must be removed to discon- 
nect any pair of vertices. It is trivial to see that a bi- 
connected subgraph can only be created by joining two 
or more other subgraphs if the joined subgraphs meet at 
at least two places — if they meet at only one then the 



and hence we can evaluate Eq. (fT3|) . 

For singly connected subgraphs the situation is more 
complicated. Such subgraphs can be created by joining 
together two or more subgraphs at single vertices, which 
happens frequently even in locally tree-like networks. As 
a result it is not trivial to calculate the expected num- 
ber of singly connected subgraphs in a network, except 
in a few special cases. (The connected triple or 2-star of 
Eq. ()13|) is an example of a singly connected subgraph 
whose number can be calculated in a straightforward 
fashion, but this is the exception rather than the rule.) 



B. Giant component 

One can also exploit the bipartite representation of 
the subgraph model to find the size of the giant com- 
ponent in the model. As with the treatment of the 
edge-triangle model in Section |TT1 we introduce excess 
degree distributions, but defined now in terms of the bi- 
partite network. Consider a subgraph node in the bi- 
partite network and imagine following one of the edges 



connected to it, an edge labeled with role r, to the ver- 
tex at its other end. Then, if that vertex has overall 
role indices di,. . . ,dr + l, ■ ■ ■ ,dc and we exclude the edge 
along which we arrived, it will have "excess" role indices 
di, . . . ,dr, ■ ■ ■ ,dc. Moreover, since the probability of our 
arriving at a vertex with role index dr is proportional 
to dr the excess role indices are distributed according to 
the probability distribution 

'2''~(^) = / , \ p{di,...,dr + 1,...,4), (15) 

{dr) 

and there is a corresponding distribution for every other 
role. 

As in the edge-triangle model it is convenient to keep 
track of the various role distributions using their gener- 
ating functions, defined by 



Go(z) = 5]p(d)z^ 

d 

Gr{z)=J2<lrid)zf' 



,rfc 



(16) 

(17) 



Note that given the generating function Go{z) we can 
calculate the generating function H{z) for the conven- 
tional degree distribution in a straightforward manner 
using Eq. (ITT|) thus: 

H{Z) - 5]p(fc) ^' = E Ep(d) ^(fc, E krdr 

k 

= Ep(d)^'^'^ 



k d 



= Go(z'=S...,z'==), 



(18) 



where kr is, as previously, the number of edges a vertex 
gains by virtue of playing role r in a subgraph. 

We also need to define generating functions for the role 
distributions on the other side of the bipartite graph, 
the side that represents the subgraphs. These generating 
functions, however, take a relatively simple form, since 
every role belongs to only one subgraph and every in- 
stance of a given subgraph contains the same distribution 
of roles. Let us define rir as before to be the number of 
times role r occurs in its own subgraph, and let us give 
the subgraphs unique labels such that gr is the label of 
the graph in which role r appears. Then the generating 
function for the excess role distribution of a subgraph 
node reached by following an edge representing role r is 



F,(z) 



^ri 



nsS„ 



(19) 



where 5ij is the Kronecker delta. (We can in principle 
write down a generating function for the normal (non- 
excess) role distribution of the subgraph nodes, but we 
omit it because it's not needed for our calculations.) 

Armed with these generating functions, we can com- 
pute the size of the giant component in the network as 



follows ^3| . Define Ur to be the probability that a vertex 
is not connected to the giant component as a result of its 
playing role r in a given subgraph. This occurs if and only 
if none of the other vertices in that subgraph, regardless 
of role, are themselves members of the giant component, 
or, in the language of the bipartite representation, if none 
of the other edges connected to the appropriate subgraph 
node lead to vertices that are in the giant component. 

Similarly, let Vr be the probability that a vertex that 
plays role r in a particular subgraph is not a member of 
the giant component by virtue of any of the other roles 
it plays. In the language of the bipartite representation, 
none of the other edges incident on such a vertex connect 
it to the giant component. 

In terms of these variables we have 



Ur = — \\ V 

s—l 



risSn 



-n«"""-'° =Frivi,...,Vc), (20) 

s—l 

c 

= E 9-(d) n "'" = ^-^("1' • • ■ ' "-)' (21) 



or, in vector notation, u = F(v) and v = G(u). Being 
primarily concerned with u, we can also eliminate v and 
write u as the solution of the fixed point equation 



u-F(G(u)). 



(22) 



Then the probability that a vertex with role indices 
di . . .dc is not in the giant component is Yir '^f^ with dr 
distributed according to p(d), so that the average prob- 
ability is 



Y^p{A)\{uf^G,{n), 



(23) 



and the average probability that a vertex is in the giant 
component is one minus this quantity: 



S'=l-Go(u). 



(24) 



Between them, Eqs. ([^ and ([M|) give us a complete 
solution for the size of the giant component. As with the 
edge-triangle model the equations are often not solvable 
in closed form, but can be solved numerically by simple 
iteration starting from a suitable initial value for u. 

As a first example, consider again the edge-triangle 
model, for which there are two subgraphs, the edge and 
the triangle, with one role each and generating functions 



Fi{zi,Z2) ^ Zi, F2{zi,Z2)=zl 



(25) 



The excess degree generating functions for the vertices 
are 

Gi{zi,Z2) = -^ E(^i + iX'^i + l,c^2)^f^;^2^ (26) 
(rfi) Y 

G2{zi,Z2) = 7^E('^2 + l)pidi,d2 + l)ztzi\ (27) 



and Eqs. (|22p and ([241) reduce to the two equations 

Ui=Gi{ui,U2), U2= [G2{ui,U2)] , (28) 

which are identical to Eqs. (I7|) and (|5]). And the size of 
the giant component as a fraction of the size of the whole 
network is given by S" = 1 — Go(mi, U2) as before. 

As a more complicated example consider a network 
built from the subgraphs shown in Fig. ^^d: single edges, 
triangles, and diamonds. Of the four roles let us label the 
ends of single edges role 1, the corners of the triangles 
role 2, and the two roles in the diamond roles 3 and 4 
(which is which will not matter). Now consider the role 
distribution p(d) = Pi(di)p2(rf2)p34('i3, ^4) where 



Pi{di) 
^34(^3,^4) 



di> d2 ' 

(1 - 2a)Srsfl6ri^o 



(29) 

(30) 



In other words, participation is Poisson distributed with 
means Ci and C2 for roles 1 and 2, and vertices participate 
either in one diamond with equal probability a of taking 
role 3 or 4, or in no diamonds with probability 1 — 2a. 

This particular distribution is chosen because it has a 
nontrivial but still relatively simple solution: after some 
work it can be shown that the size S of the giant com- 
ponent obeys 

S=l-{1- 2a)e-^^^-^^^(2-s) _ 2ae-*-^s-4c,si2-s) ^ 

(31) 
We show the form of this solution as a function for a for 
one particular choice of parameters in Fig. [S] 



C. Position of the phase transition 

As with other random graph models, the fixed point 
equation, Eq. ([22|) . has a trivial solution at Ur = 1 for 
all r corresponding to the state in which there is no giant 
component in the network and this fixed point undergoes 
a transcritical bifurcation at the point at which a giant 
component appears. We can locate the bifurcation, and 
hence the appearance of the giant component, by linear 
stability analysis of the fixed point. We write Ur = 1 — er 
and expand to first order in the small quantities Cr , which 
gives e = Me where M is the ex c matrix with elements 



dGt 



Mrs = J2i^t^9rgt - ^rt)jrr- 



dZs 



(32) 



z=l 



Using the definition of Gr{'z) given in Eq. p7|) we can 
show that 



dGt 

dzs 



z=l 



(dt) 



(33) 



and hence that 



Mrs = S, 



rs ^rs 



{drds 



^sSgrg, + J2 -^lT-'T't^9rgt ' (34) 




FIG. 5: The size S of the giant component in the network of 
edges, triangles, and diamonds described in the text, with the 
average role indices for edges and triangles fixed at (di) = i 
and (^2) ~ i, plotted as a function of the parameter a that 
controls the two role indices for the diamonds (see Eq. pO[l ). 
The solid line represents the analytic result and the circles 
are simulation results averaged over 100 networks with 10^ 
vertices each. 



Physically the matrix element Mrs measures a branch- 
ing ratio for the locally tree-like bipartite graph: a ver- 
tex that plays role r shares the relevant subgraph with 
some number of other vertices and Mrs measures the av- 
erage number of times those other vertices collectively 
play role s in further subgraphs. 

Now consider a set of randomly chosen vertices in a 
large network and suppose we grow that set repeatedly 
by adding to it all vertices with which its members share 
a subgraph. If we focus on the boundary of the set — 
meaning those vertices added on the most recent step — 
and represent the number of times the boundary vertices 
play roles 1 to c by a c-component vector, then the ex- 
pected value of the vector is multiplied on each step by 
one factor of M. If the sum of the vector elements grows 
we have a giant component and if it dwindles to zero, 
so that the set stops growing, then we do not, meaning 
that we have a giant component if and only if at least 
one eigenvalue of M is greater than one. If all eigenval- 
ues are less than one then there is no giant component, 
and if one or more eigenvalues are exactly equal to one 
and none are greater then we are precisely at the phase 
transition point at which the giant component appears. 

In general the eigenvalues of M are not trivial to 
find, but in some cases the problem simplifies. Con- 
sider, for instance, the case in which the role indices dr 
are independent and Poisson distributed, in which case 

(drds) = {dr){ds) + {dr)Srs SO that 

Mrs = iNr-l){ds), (35) 

where Nr = ^^ ntSg^g^ is the number of vertices in the 



subgraph in which role r appears. As an outer product of 
two vectors, this matrix is defective, having only a single 
eigenvector with corresponding eigenvalue A = Ylri^r — 
l){dr). Thus in this network a giant component exists if 
and only if 



J2{Nr-l){dr 



> 1. 



(36) 



In simple language, this equation says that a giant com- 
ponent exist if and only if the average number of vertices 
with which a random vertex shares a subgraph is greater 
than one. 

Note that the degree kr of role r within its subgraph 
trivially is never greater than N^ — 1 and hence when we 
are precisely at the transition point at which the giant 
component forms the average degree, Eq. P^ . satisfies 

(fc) = Y, kr{dr) < J2^Nr - l){dr) = 1. (37) 



Recall that in the standard random graph of Erdos and 
Renyi the transition occurs precisely at (fc) = 1 and 
Eq. p7l) thus tells us, in some sense, that the transi- 
tion happens "earlier" in this subgraph model, or at least 
no later — in general a giant component will form when 
the average degree is less than one. The physical insight 
behind this observation is that belonging to a subgraph 
guarantees a vertex connections to all the other vertices 
in that subgraph, which vertices may be significantly 
greater in number than merely the immediate neighbors 
of the first vertex. 

Another simple solvable case is that of a network com- 
posed of two types of subgraph, each with a single role. 
The edge-triangle model of Section |IT] is a special case of 
such a model, but the general case is also solvable. The 
matrix M takes the form 



M = 



(ni - 1) 



(dl) - (di) 



V 



(712 - 1 



(dl) 

{did2) 
{d2) 



(m - 1) 



(^1^2) \ 



("2-1 



{dl) {d2 
{d2) J 



(38) 

and the largest eigenvalue is A = 5 (^ + \^t'^ — 4A) where 
r and A are the trace and determinant of the matrix 
respectively. There is a giant component if and only 
if this eigenvalue is greater than one, or equivalently if 
\/t^ — 4A > 2 — r. This condition is satisfied if either 
T>2orT>A + l. In terms of the matrix elements, 
these inequalities can be written 



{dl) 



{d2 



and 



{did2)^ 


\{dl) 
[{dl) 


ni 


\{dl) 
[{d2) 


n2 


{di){d2) 


ni — 1 


n2 — 1 



7TT —r >0- (40) 



Note that Eq. (|39p does not depend on the correlation 
term (^1^2) between the two role indices: in physical 
terms it says simply that there is a giant component if the 
densities of the two subgraphs independently are enough 
to create one. A giant component can, however, also 
be created by correlations between the placement of the 
subgraphs even when the overall density of subgraphs 
would otherwise be inadequate, and giant components of 
this kind are described by Eq. pO|) . Thus, for instance, if 
the vertices with many of one subgraph also have many 
of the other, then these high-degree vertices may join up 
to form a giant component even if there would be none 
were the same subgraphs allocated to different vertices. 
In the examples we have examined, the size of the gi- 
ant component and the position of the phase transition 
are determined solely by the role distribution p{d) and 
the numbers rir of roles in their respective subgraphs. 
The actual shape of the subgraphs themselves does not 
matter — once one vertex in a subgraph is in the giant 
component, the rest automatically are as well, regardless 
of patterns of connection within the subgraph. Thus, for 
instance, a network composed of a given distribution of 
cliques of given sizes would have a giant component of 
the same size as a network composed of the same dis- 
tribution of loops of the same sizes. Not all network 
properties, however, show this behavior. In the next sec- 
tion we look at percolation on our networks, for which 
the shapes of the subgraphs matter greatly. 



D. Percolation 

Site and bond percolation have important implications 
for network resilience [2J, [23 and the behavior of spread- 
ing processes on networks |26l428l |. In site percolation, 
each vertex of a graph is present, functional, or "occu- 
pied" with independent probability ^g. In bond perco- 
lation each edge is similarly occupied with independent 
probability (j)b- 

It turns out that percolation on networks generated by 
our subgraph model can be treated using the same ma- 
chinery developed above for calculating properties of the 
giant component. The only thing that needs to change 
is the generating function Fr{z), Eq. ([T^. for the proba- 
bility distribution of the number of vertices of each role 
that a vertex of role r can reach in its subgraph. In our 
previous treatment this distribution took a particularly 
simple form, since a vertex of role r could by definition 
reach all other vertices in its subgraph. Once we intro- 
duce percolation, however, some vertices in the subgraph 
may become unreachable, because there is no path of 
occupied vertices or edges along which to travel. 

We will here consider the general case of simultaneous 
site and bond percolation — both vertices and edges may 
be occupied (or not) with probabilities (ps and (pb (or 
1 — (ps and 1 — (pb). Percolation on vertices or edges 
alone, i.e., conventional site or bond percolation, is the 
special case of this more general process when either 0^ 
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or (j)b equals one. 

Let us denote by Pr{4>si4>b]^) the probability that an 
occupied vertex of role r in a subgraph is connected via 
occupied vertices and edges io di ... dc occupied vertices 
of roles 1 ... c in the same subgraph, given the values (ps 
and (pb of the occupation probabilities. Then our gener- 
ating function F^ is defined by 



h;^) = '^Pr{(l)s,(l)b;d)W_. 



(41) 



i5 = l 



In terms of this function, the size of the giant percolation 
cluster is given by 



u = r(G(u)), s^4>s[i-Go{m)]. 



(42) 



(See Eqs. p2|) and (|24)) for the corresponding equations 
in our earlier calculation.) The additional factor of ps in 
the second equation accounts for the fact that a vertex 
can only be in the giant component if it is itself occupied. 
All we need now to complete the calculation is the form 
of F^. 

As an example, consider pure site percolation on a sub- 
graph that takes the form of an m-clique — a set of m 
vertices with an edge between every pair. The vertices in 
a clique have only one role, and an occupied vertex i can 
reach another vertex j if and only if j is itself occupied. 
Thus the distribution pr is binomial in this case: 



Pr{(f>;d) 



TO — 1 
d 



(1 



(43) 




FIG. 6: The size S of the giant percolation cluster in the 
network of chques described in the text, with chques of size 
2 to 5, as a function of the site percolation probability (p. 
The circles, triangles, and squares show simulation results 
averaged over 100 networks with 10^ vertices each for (k) = 2, 
4, and 8 respectively, and the solid lines are the corresponding 
analytic results. If role r = m is the role played by vertices 
in cliques of size m then the average role indices are (dm) = 
j{k)/{rn — 1) (which means that a randomly chosen edge is 
equally likely to belong to a clique of any size). 



and 



Fr 



E 

d=0 



= {\-ct> + 4>zy 



cP\l-^) 



m—l — d d 



(44) 



where (j> is the vertex occupation probability. 

Suppose, for instance, that we have a network made 
entirely of cliques of various sizes m having one role each 
labeled with role labels r — m, and suppose the corre- 
sponding role indices dm are independently Poisson dis- 
tributed at each vertex. Then Gm(d) = Go(d) for all m 
and 



^[l~P + ^Go{u)Y 



S = P[l - Go{u)]. (45) 



Thus in this case we have Um = (1 ^ 5')'" ^ for all to and 
S satisfies the self-consistent equation 

S ^ <j>[l - Go{l - S,{1 - Sf,{l - S)\ . . .)] 
= <j>[l-Hil-S)], (46) 

where H{z) is the generating function for the ordinary 
degree distribution, defined in Eq. ((T5)) . The percolation 
transition in this network corresponds to the point where 
the gradient of 0[1 — H{1 — S)] at S = is 1, i.e., the 
point 



1 



HUD 



1 



(47) 



Note that this is the same condition as for the percolation 
point on an ordinary Poisson random graph, although the 
network is quite different. In Fig. [5] we show a plot of S 
from Eq. (j46p , along with the results of numerical simu- 
lations of finite sized networks in this class. As the figure 
shows, the agreement between simulation and theory is 
excellent. 

The computation of Fr for general subgraphs is typi- 
cally more involved than for the simple clique, since one 
must take into account possible paths between vertices 
and allow for different roles. Nonetheless, the compu- 
tations are in principle merely a matter of counting the 
number of reachable vertices of each role for each possible 
configuration of occupied vertices or edges and multiply- 
ing by the probability of the configuration. We give in 
Table U the generating functions for all roles of all bicon- 
nected subgraphs of four vertices or fewer (plus the single 
edge) , calculated by exhaustive enumeration in this fash- 
ion. Generating functions for singly connected subgraphs 
are simple composites of the biconnected ones. 

If we only wish to determine the position of the perco- 
lation transition, the full subgraph generating functions 
are not necessary. Linearizing around the trivial fixed 
point in the equations for u we again derive an equa- 
tion e — Me, and the existence of a giant component 
depends on whether the matrix M has any eigenvalues 
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Subgra 


ph 


Role 


Generating function Fr 


• — • 


• 


1 - (l)s4>b + 4>s(t>bz 


A 


• 


(1 - 0,<^fc)2 + 20,(/)b(l - (/),0fc(2 - (j)b))z + fyli^ - 2(j)b)z^ 


i 
i 


* i 

* i 




• 


(1 - <PsCl>b)^ + 20.(/.b(l - 0.0fc)2z + 302^2(1 _ ^^^^(2 - (/.b))z2 + 03^3(4 - S./)^)^^ 


N 




1 
• 

2 
O 


(1 - C^s(f>b? + 2<j>sM^ - (/.,0fc)(l - ^sM'^ - M)Z2 + 4'14'lii - 2(/'6)(l - 0s</'6(2 - 
0b))z| + 2(/)2</,2(l - 3(/.,0b + 3(/),02 _ 0^^3) + ^3^3(8 _ 11^^ + 402)^1^2 

(1 - 0,0fc)3 + 2<\,,(^b{\ - (/-s-^h)!! - 0.</'6(2 - (/)6))zi + 020g(l - 3(/),(/)6 + 3(/.,<^g - 


< 
< 


X 




• 


(1 - 0s</)h)3 + Z4>s^h{\ - ^sM^ - <f>b)?Z + 30202(3 _ 20fc)(l - 3(f>s^b + 'i(t)s4>l - 

(f>sc^l)z^ + (j)l<j>liW - 33(j)b + 2#2 _ 603)z3 



TABLE I: Subgraph generating functions Fr for all biconnected subgraphs of four vertices or fewer. 



greater than one. In this case M is given by 



Mrs--J2 



dFr dGt 



dzt dzs 



(48) 



The derivative [dFr/dzt\z=i is equal to the mean number 
of vertices of role t reachable from a vertex of role r within 
the appropriate subgraph. It is an open question vi^hether 
there exists a method for calculating this mean number 
faster than the exhaustive enumeration of states used to 
calculate the generating function itself. 



V. CONCLUSION 

In this paper we have proposed and analyzed a ran- 
dom graph model that incorporates an arbitrary distribu- 
tion of any chosen set of subgraphs or motifs, and hence 
mimics the properties of real-world networks, which are 
observed in many cases to contain certain subgraphs in 
significant numbers. The model is easily treated numer- 
ically and many of its properties can be calculated ana- 
lytically by virtue of a mapping to a locally tree-like bi- 
partite graph. In particular, we have given calculations 
of subgraph counts, the size of the giant component, the 
position of the transition at which the giant component 
appears, and percolation properties for site and bond per- 
colation. 

Useful though the model may be, however, it leaves 
open some important questions. We have not considered. 



for instance, how one should select the set of subgraphs 
to be used. If we wish to model a particular real-world 
network, then we would presumably want to look at the 
subgraphs that appear in that network and mimic their 
density and placement as accurately as possible in the 
model. But in that case, which subgraphs should be con- 
sidered to occur sufficiently frequently as to require their 
inclusion in the model? And what should the distribu- 
tion of the various subgraphs be? For biconnected sub- 
graphs the density in the model network is, as we have 
shown, simply the density of the subgraphs we add ex- 
plicitly, since those created by the combination of other 
subgraphs have density zero. For singly connected sub- 
graphs, however, the density is more complicated, con- 
taining as it does not only the subgraphs we add ourselves 
but also those made from other subgraphs, and at present 
we do not have a good way to calculate it, making the 
matching of the real and model networks a challenge. 

Some further generalizations of the model are possible. 
One could consider subgraphs in which the stubs are la- 
beled, for instance with colors, so that even vertices in the 
same role could be distinguished. By specifying the col- 
ors of stubs connected to each vertex as well as the roles 
one could then induce additional kinds of structure, such 
as bipartite or fc-partite structure or assortative mixing. 
One could also include role-role correlations, by analogy 
with the degree-degree correlations studied in ordinary 
random graphs. 

The treatment given in this paper also leaves open 
some mathematical questions. In particular, it is unclear 
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whether there is a quicker way to calculate the crucial 
subgraph generating functions of Section IIVDI than by 
exhaustive enumeration of states. For certain families of 
graphs, such as cliques, we have shown that it is possi- 
ble to characterize the generating functions analytically, 
but it seems unlikely that this will be possible in more 
general cases. It is possible, however, that one could find 
a numerical algorithm for calculating the coefficients of 
the generating functions more quickly than the current 
method, which is exponentially slow. 
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